Using syllable-based indexing features and language models to improve German spoken document retrieval

نویسندگان

  • Martha Larson
  • Stefan Eickeler
چکیده

Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used to generate syllable-based indexing features. Recognition is carried out with a 5k syllable language model and a 10k mixed-unit language model whose vocabulary consists of a mixture of words and syllables. Both language models make retrieval performance possible that is comparable to that attained when a large vocabulary wordbased language model is used. Experiments are performed on a spoken document collection consisting of short Germanlanguage radio documentaries. First, the vector space model is applied to a known item retrieval task and a similar-document search. Then, the known item retrieval task is further explored with a Levenshtein-distance-based fuzzy word match.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Chinese Spoken D with Hybrid Modeling and D Feature

Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval mode...

متن کامل

Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features

Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval mode...

متن کامل

Retrieval of mandarin broadcast news using spoken queries

Considering the monosyllabic structure of the Chinese language, a whole class of indexing features for retrieval of Mandarin broadcast news using syllable-level statistical characteristics has been previously investigated. This paper presents the improvements achieved over the previous results. The major differences are: (1) Multi-scale characterand word-level indexing terms have been integrate...

متن کامل

Multi-scale and Multi-model Integratio in Chinese Spoken Docume

This paper describes our attempt to combine the relative merits of different indexing units (scales) and different retrieval models to improve performance in Chinese spoken document retrieval. Our study includes indexing units from three scales: words, character bigrams and syllable bigrams. We also include two different retrieval models: the HMM-based model and the vector space model (VSM). Ou...

متن کامل

An HMM/n-gram-based linguistic processing approach for Mandarin spoken document retrieval

In this paper an HMM/N-gram-based linguistic processing approach for Mandarin spoken document retrieval is presented. The underlying characteristics and different structures of this approach were extensively investigated. The retrieval capabilities were verified by tests with indexing features of wordand syllable(subword)-levels and comparison with the conventional vector space model approach. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003